RNASeqvsIsoSeq of Sample O23
RNASeqvsIsoSeq of Sample O23
- Step 1) IsoSeq Preparation: Annotate2Abundance
- Step 2) IsoSeq Preparation: SumFLCounts
- Step 3) RNASeq Preparation
- Step 4) Merge RNASeq and IsoSeq
- Step 5) Data Review for Full_Merge: RNASeq vs IsoSeq
- Step 6) Correlation of RawData
- Step 7) Correlation of log(Data)
- Step 8) Missing Reads
- Step 9) Correlation of only FSM transcripts
- Transcript Abundance vs Gencode (All Sqanti Classifications)
Objective: To tabulate the number of full-length reads obtained per gene from Isoseq and order genes from high to low, for comparison with RNAseq data for exact sample
Rationale: To evaluate whether Isoseq output comparable to RNAseq output
Analysis: 1. Downloaded raw subread.bam file from Sequel output 2. CCS and Isoseq3 command line (Lima, Cluster, Polish) 3. Mapped to mouse genome using GMAP 4. Tofu Cupcake 5. Sqanti for isoform characterisation
source("/gpfs/mrc0/projects/Research_Project-MRC148213/sl693/Scripts/IsoSeq3_Tg4510/RNASeqvsIsoSeq_2.R")
sample <- c("O23")Step 1) IsoSeq Preparation: Annotate2Abundance
Define function for Importing and Merging SQANTI classification file and TOFU abundance file
Input: Sqanti_Filter Classification Output file * All details of HQ-unique isoforms classified by assigning PacBio output gene Cluster ID to mouse gene name
Input: ToFU Abundance Output file * Quantification of number of Full_Length per PacBio_ID
Output: Merged txt file by PacBio ID * Merged txt file has the gene name by which the isoform belongs to (as identified by SQANTI) and the quantification of FL_counts (as quantified in TOFU) by PacBio ID
[1] "Input SQANTI Filter output file for Sample O23"
[1] "SQANTI Classification file of Sample O23"
Review of SQANTI Classification file
In each SQANTI Classification file, there are parameters that can further allow filtering:
- Classification of Isoform Types: antisense, full-splice_match, incomplete-splice_match, intergenic, novel_in_catalog, novel_not_in_catalog
- Full Splice Match - matches reference perfectly
- Incomplete Splice Match - matches reference partially
- Novel In Catalog - novel isoform using known junctions
Novel Not In Catalog - novel isofroms using novel junctions
For Sample O23, there are total 17167 transcripts, of which there are:
11690 Full-Splice Match Transcripts, 2156 Incomplete Splice Match, 2317 Novel In Catalog, 984 Novel Not In Catalog
- Reverse Transcription Switching: FALSE, TRUE
- TRUE - one of the junctions could be switching artifact; still retained after filtering as overruled if there are canonical junctions
FALSE - not switching artifact
For sample O23, there are 16096 Transcripts labelled as RTS FALSE, and 1071 transcripts labelled as RTS TRUE. 6.653827 of RTS reported. However, note, still 10 transcripts with TRUE RTS Stage but non-canonical junction.
- Coding of transcripts: coding, non_coding
- For sample O23, there are 15412 coding transcripts, and 1755 transcripts.
- Percentage of As in downstream in TTS: percent of genomic “A”s in the downstream 20 bp window. If this number if high (say > 0.8), the 3’ end site of this isoform is probably not reliable
[1] "Input SQANTI Filter output file for Sample O23"
[1] "/gpfs/mrc0/projects/Research_Project-MRC148213/sl693/WholeTranscriptome/Individual/ToFU/O23.collapsed.filtered.abundance.txt"
[1] "Abundance file of Sample O23"
[1] "Merged file of SQANTI Classification and Abundance File of Sample O23"
# IsoSeq Classification File with PacBio ID but no corresponding isoform ID in TOFU, but no read count?
Merge_IsoSeq[which(is.na(Merge_IsoSeq$norm_fl)),] isoform chrom strand length exons structural_category associated_gene
1986 PB.2.1 chr1 - 1049 1 novel_in_catalog Xkr4
associated_transcript ref_length ref_exons diff_to_TSS diff_to_TTS
1986 novel 3634 3 NA NA
diff_to_gene_TSS diff_to_gene_TTS subcategory
1986 50633 -405335 mono-exon_by_intron_retention
RTS_stage all_canonical min_sample_cov min_cov min_cov_pos sd_cov FL
1986 FALSE <NA> NA NA <NA> <NA> 2
n_indels n_indels_junc bite iso_exp gene_exp ratio_exp FSM_class
1986 28 NA NA NA NA NA B
coding ORF_length CDS_length CDS_start CDS_end perc_A_downstream_TTS
1986 coding 177 534 358 891 60.0
dist_to_cage_peak within_cage_peak polyA_motif polyA_dist count_fl
1986 NA False NA NA NA
count_nfl count_nfl_amb norm_fl norm_nfl norm_nfl_amb
1986 NA NA NA NA NA
[ reached 'max' / getOption("max.print") -- omitted 10 rows ]
Step 2) IsoSeq Preparation: SumFLCounts
Define function that the FL Counts for all transcripts per gene
Motivation: SQANTI Filter classification outputs one gene with multiple isoforms, thus complicates correlation with RNA-Seq Gene Expression Counts. PacBio FL count is presented per isoform rather than per gene. However, FeatureCount’s output from RNA-seq data is on a gene level. Therefore FL counts from IsoSeq needs to be summed for more convenient comparison: Total FL Counts of Transcripts per Gene from IsoSeq vs Raw Gene Counts from RNASeq
Alternative option: select only isoform with the highest number of FL counts, yet biased results especially given if many isoforms with similar or slgihtly smaller number of FL-counts. Assumptions: RNA-seq captures expression of all RNA transcripts irrespective of isoforms
Step 3) RNASeq Preparation
Input: FeatureCounts of all RNASeq samples (STAR Aligned to mm10 genome, and annotated to Gencode Mouse V20 gtf file) at gene level Output: FeatureCount of specific sample
[1] "Input FeatureCount for All Samples"
B21 C21 K17 K23 M21 O23 Q21 S23
ENSMUSG00000000001.4_Gnai3 761 565 374 523 582 375 418 410
ENSMUSG00000000003.15_Pbsn 0 0 0 0 0 0 0 0
ENSMUSG00000000028.15_Cdc45 19 20 20 25 32 24 24 24
ENSMUSG00000000031.16_H19 1 2 3 2 0 0 12 0
ENSMUSG00000000037.16_Scml2 7 14 12 6 12 7 4 15
ENSMUSG00000000049.11_Apoh 3 3 1 4 0 0 3 2
[1] "Input FeatureCount for Sample O23"
[1] "Validation of summing PacBio FL"
[1] "Original input data from ToFU Abundance files for the Gene App"
[1] "Summed PacBio FL count for the Gene App saved as new dataframe for downstream analysis"
[[1]]
associated_gene count_fl
5025 App 3
5026 App 3
5027 App 104
5028 App 2
5029 App 33
5030 App 3
5031 App 5
5032 App 2
5033 App 4
5034 App 3
[[2]]
# A tibble: 1 x 3
associated_gene PacBio_Isoform PacBio_FL_Counts
<chr> <fct> <int>
1 App PB.3566.1 162
Step 4) Merge RNASeq and IsoSeq
Input: Sample-specific Isoseq (Dataframe: Merge_IsoSeq_SumFL) and RNASeq (Dataframe: RNASeq) Counts Output: Dataframe “Full_Merge”: Merged Counts across IsoSeq and RNASeq by gene names
Also to call out specific counts of AD-associated genes, created function AD_Counts.
Full_Merge <- merge(Merge_IsoSeq_SumFL,RNASeq,by.x = "associated_gene",by.y = "Mgi_Symbol",all = TRUE)
head(Full_Merge) # output file name = Full_Merge associated_gene PacBio_Isoform PacBio_FL_Counts RNASeq O23 Raw Counts
1 0610005C13Rik <NA> NA 12
2 0610009B22Rik PB.1205.1 7 148
3 0610009E02Rik <NA> NA 9
4 0610009L18Rik <NA> NA 29
5 0610010F05Rik PB.1131.1 9 413
6 0610010K14Rik <NA> NA 14
associated_gene PacBio_Isoform PacBio_FL_Counts
1487 Apoe PB.7782.1 1881
1494 App PB.3566.1 162
10177 Mapt PB.1690.1 42
12935 Psen1 PB.2058.1 7
RNASeq O23 Raw Counts
1487 23649
1494 17898
10177 6759
12935 683
Step 5) Data Review for Full_Merge: RNASeq vs IsoSeq
Motivation: Within Full_Merge dataframe, interested to know which genes are detected only by IsoSeq, only by RNASeq, and alone. Also later downstream, able to plot the number of respective counts for these genes.
[1] "Total Number of Genes in Full_Merge of IsoSeq and RNASeq: 17969"
[1] "Total Number of Genes Detected in IsoSeq AND RNASeq: 8774"
[1] "Total Number of Genes Detected in IsoSeq but not RNASeq: 133"
[1] "Total Number of Genes Detected in RNASeq but not IsoSeq: 9062"
[1] "/gpfs/mrc0/projects/Research_Project-MRC148213/sl693/RNASeq/Correlations/O23_Full_Merge.csv"
Step 6) Correlation of RawData
output: Correlation of Gene Expression of IsoSeq FL Counts vs RNASeq Raw Counts. Correlation coefficient calculated from pearson’s method (assuming parametric) and considers
Pearson's product-moment correlation
data: Full_Merge$Isoseq_FL_Counts and Full_Merge$RNASeq_Raw_Counts
t = 73.255, df = 8772, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6029289 0.6288985
sample estimates:
cor
0.6160811
Step 7) Correlation of log(Data)
motivation: As seen above, due to densely populated points of numbers with several extreme values, difficult to see plot. Thus, logged points for visual output: Correlation of Gene Expression of log(10)(IsoSeq FL Counts) vs log(10)(RNASeq Raw Counts). Note, correlation coefficient doesn’t change. However, as it is not possible to log 0, can only consider genes detected in both technology.
#Interactive_Log(Log_Full_Merge)
density_plot(Log_Full_Merge,"Log_Isoseq_FL_Counts","Log_RNASeq_Raw_Counts")Step 8) Missing Reads
input: Genes either detected by IsoSeq or RNASeq from Full_Merge dataframe (IsoSeq and RNASeq Counts/gene) output: Plot of those genes with its respective counts
[1] "Genes with no IsoSeq Reads but RNASeq RawCounts > 5000"
[1] "Ap2m1" "Apbb1" "Dnm1" "Dst" "Huwe1" "Kcnj10"
[7] "Mapk8ip3" "Shank1" "Slc12a5" "Stum" "Syngap1" "Syt7"
[13] "Unc13a" "Xist"
[1] "Genes with only IsoSeq Reads, and no RNASeq Reads"
[1] "4930509H03Rik" "4930578G10Rik" "A730089K16Rik" "A930015D03Rik"
[5] "Aarsd1" "AC110573.1" "AC121802.1" "AC124484.1"
[9] "AL731706.1" "B230362B09Rik" "B3gnt2" "C1qtnf5"
[13] "Ccdc22" "D630033A02Rik" "Entpd4" "Entpd4b"
[17] "Epo" "Exosc6" "Fam177a" "Fen1"
[21] "Galnt2" "Gemin4" "Gm10108" "Gm11518"
[25] "Gm13370" "Gm14440" "Gm15972" "Gm19409"
[29] "Gm20186" "Gm20388" "Gm20427" "Gm20458"
[33] "Gm20460" "Gm20634" "Gm20662" "Gm20683"
[37] "Gm20695" "Gm21969" "Gm21974" "Gm21988"
[41] "Gm26551" "Gm26561" "Gm26668" "Gm26786"
[45] "Gm26904" "Gm27029" "Gm28052" "Gm29232"
[49] "Gm29253" "Gm3002" "Gm3448" "Gm3591"
[53] "Gm38182" "Gm42416" "Gm42418" "Gm42420"
[57] "Gm42466" "Gm42936" "Gm44321" "Gm44503"
[61] "Gm45021" "Gm45140" "Gm45153" "Gm45213"
[65] "Gm45234" "Gm45837" "Gm47580" "Gm49032"
[69] "Gm49321" "Gm49354" "Gm49358" "Gpr25"
[73] "Gstp2" "Gtsf1" "H2-Ke6"
[ reached getOption("max.print") -- omitted 58 entries ]
Step 9) Correlation of only FSM transcripts
[1] "Merged file of SQANTI Classification and Abundance File of Sample O23"
[1] "Total Number of Genes in FSM_Full_Merge of IsoSeq and RNASeq: 17905"
[1] "Total Number of Genes Detected in IsoSeq AND RNASeq: 7711"
[1] "Total Number of Genes Detected in IsoSeq but not RNASeq: 69"
[1] "Total Number of Genes Detected in RNASeq but not IsoSeq: 10125"
Transcript Abundance vs Gencode (All Sqanti Classifications)
MERGE SQANTI and GENCODE input
Merge the tabulated number of isoforms from SQANTI filter classification txt and GENCODE based on the GeneName
gencode <- read.table("/gpfs/mrc0/projects/Research_Project-MRC148213/sl693/reference_2019/gencode.vM20_gene_annotation_table.txt", sep="\t")
Prepare_Gencode(gencode)[1] "Gencode vM20 Gene Annotation"
[1] "Gencode vM20 Gene Annotation with summed transcripts per gene"